STATS 32 Session 5: Data Analysis Projects

Kenneth Tay

Oct 17, 2017

Recap of week 2

Syntax in R

The most important syntax in R is the function call. All R syntax has function calls underlying it.

function_name(<inputs to the function>,
              <arguments which change 
              how the function operates>)

Syntax in R: example

x <- c(-5, -3, -1, 1, 3, NA)
mean(x)
## [1] NA

Syntax in R: example

mean(x, na.rm = TRUE)
## [1] -1

Function calls read “inside out”

abs(x): If x is positive, return x. If x is negative, return x without the negative sign.

mean(abs(x), na.rm = TRUE)
## [1] 2.6

Function calls read “inside out”

abs(x): If x is positive, return x. If x is negative, return x without the negative sign.

mean(abs(x), na.rm = TRUE)
## [1] 2.6

dplyr syntax without %>%

Take the mtcars dataset, select just the wt and mpg columns:

library(dplyr)
data(mtcars)
select(mtcars, wt, mpg)

dplyr syntax without %>%

Take the mtcars dataset, select just the wt and mpg columns, then select rows with mpg < 15

Evaluating from inside-out, but code reads outside-in.

filter(select(mtcars, wt, mpg), mpg < 15)

dplyr syntax without %>%

Take the mtcars dataset, select just the wt and mpg columns, then select rows with mpg < 15

Evaluating from inside-out, but code reads outside-in.

filter(select(mtcars, wt, mpg), mpg < 15)

dplyr syntax with %>%

Take the mtcars dataset, select just the wt and mpg columns, then select rows with mpg < 15

mtcars %>% select(wt, mpg) %>% filter(mpg < 15)

dplyr syntax with %>%

mtcars %>% 
    select(wt, mpg) %>% 
    filter(mpg < 15)

Moral: dplyr can be used without %>%, but %>% makes code much more intuitive.

ggplot2 syntax

+ operator is meant to mimic how the data analyst thinks when making a plot: adding things to the plot one at a time.

library(ggplot2)
ggplot()

ggplot2 syntax

ggplot() +
    geom_point(data = mtcars, mapping = aes(x = wt, y = hp))

ggplot2 syntax

ggplot() +
    geom_point(data = mtcars, mapping = aes(x = wt, y = hp)) +
    labs(title = "Horsepower vs. Weight", x = "Weight", 
         y = "Horsepower")

ggplot2 syntax

ggplot() +
    geom_point(data = mtcars, mapping = aes(x = wt, y = hp)) +
    labs(title = "Horsepower vs. Weight", x = "Weight", 
         y = "Horsepower") +
    theme_classic()

ggplot2 syntax

Geometries need data and mappings. If not in the parentheses behind it, it looks for them in the ggplot() call.

ggplot(data = mtcars, mapping = aes(x = wt, y = hp)) +
    geom_point() +
    labs(title = "Horsepower vs. Weight", x = "Weight", 
         y = "Horsepower") +
    theme_classic()

Agenda for today

“Official” cheat sheet for readr and tidyr available here.

Scripts in R

Working directories in R

Projects in R

Case study: Drought in California

First try: California’s Open Data Portal

Front page of data.ca.gov

There’s a water section! Looks promising…

In the middle of the 2nd page of results…

Drought statistics! But what’s with that description… Let’s click on it anyway…

CA Drought Monitor Basic Statistics

3 things stand out:

Always a good idea to preview before downloading. Let’s click on the “Preview”…

Huh???

Second try: Homepage URL

After a couple of clicks…

Looks like what we want!

Download the dataset

I’ve saved you the trouble by going through all these steps: you can download the csv file from Canvas (under Files, in the Session 5 folder).

Data come in all sorts of formats

Different packages for working with different data formats

tidyr verbs: gather and spread

gather: Used when some column names are not variables, but values of a variable

(Source: R for Data Science)

spread: Opposite of gather

(Source: R for Data Science)

tidyr verbs: separate and unite

separate: Used to separate values in one column into multiple columns

(Source: R for Data Science)

unite: Opposite of separate

(Source: R for Data Science)